Comparing machine learning approaches to predict SEEG accuracy

Stereoencephalography (SEEG) is a technique used in drug-resistant epilepsy patients that may be a candidate for surgical resection of the epileptogenic zone. Multiple electrodes are placed using a so-called "frame based" stereotactic approach, in our case using the Leksell frame. In our previous paper "Methodology, outcome, safety and in vivo accuracy in traditional frame-based stereoelectroencephalography" by Van der Loo et al (2017) we reported on SEEG electrode implantation accuracy in a cohort of 71 patients who were operated between September 2008 and April 2016, in whom a total of 902 electrodes were implanted. Data for in vivo application accuracy analysis were available for 866 electrodes.

The goal of the current project is to use a public version of this dataset (without any personal identifiers) to predict electrode implantation accuracy by using and comparing different machine learning approaches.

Pieter Kubben, MD, PhD
neurosurgeon @ Maastricht University Medical Center, The Netherlands

For any questions you can reach me by email or on Twitter.

Data description

The public dataset contains these variables:

  • PatientPosition: patient position during surgery (nominal: supine, prone)
  • Contacts: nr of contacts of electrode implanted (ordinal: 5, 8, 10, 12, 15, 18)
  • ElectrodeType: describes trajectory type (nominal: oblique, orthogonal). Oblique refers to implantation using the Leksell arc, and orthogonal using a dedicated L-piece mounted on the frame (mostly implants in temporal lobe) when arc angles become too high (approx > 155°) or too low (approx < 25°)
  • PlanningX: planned Cartesian X coord of target (numeric, in mm)
  • PlanningY: planned Cartesian Y coord of target (numeric, in mm)
  • PlanningZ: planned Cartesian Z coord of target (numeric, in mm)
  • PlanningRing: planned ring coord, the trajectory direction in sagittal plane (numeric, in degrees); defines entry
  • PlanningArc: planned arc coord (trajectory direction in coronal plane (numeric, in degrees); defines entry
  • DuraTipDistancePlanned: distance from dura mater (outer sheet covering the brain surface) to target (numeric, in mm)
  • EntryX: real Cartesian X coord of entry point (numeric, in mm)
  • EntryY: real Cartesian Y coord of entry point (numeric, in mm)
  • EntryZ: real Cartesian Z coord of entry point (numeric, in mm)
  • TipX: real Cartesian X coord of target point (numeric, in mm)
  • TipY: real Cartesian Y coord of target point (numeric, in mm)
  • TipZ: real Cartesian Z coord of target point (numeric, in mm)
  • SkinSkullDistance: distance between skin surfacce and skull surface (numeric, in mm)
  • SkullThickness: skull thickness (numeric, in mm)
  • SkullAngle: insertion angle of electrode relative to skull (numeric, in degrees)
  • ScrewLength: length of bone screw used to guide and fixate electrode (ordinal: 20, 25, 30, 35 mm)

The electrodes are the Microdeep depth electrodes by DIXI Medical.

To the limited extent possible in this case I tried to make these FAIR data and adhere to FAIR guiding principles. In practice this meant I introduced the topic, described my data and created a DOI.

Now let's get started.


In [90]:
# import libraries
import turicreate as tc
import h5py
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
plt.style.use('ggplot')
%matplotlib inline
import warnings; warnings.simplefilter('ignore')
#%xmode plain; # shorter error messages
pd.options.mode.chained_assignment = None 
# global setting whether to save figures or not
# will save as 300 dpi PNG - all filenames start with "fig_"
save_figures = False

In [2]:
# load data
electrodes = pd.read_csv('../data/electrodes_public.csv')

In [3]:
# find missing values
nan_rows = sum([True for idx,row in electrodes.iterrows() if any(row.isnull())])
print('Nr of rows with missing values:', nan_rows)


Nr of rows with missing values: 823

In [4]:
def missing_values_table(df): 
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum()/len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        return mis_val_table

In [5]:
missing_values_table(electrodes)


Out[5]:
Missing Values % of Total Values
PatientPosition 0 0.000000
Contacts 56 6.466513
ElectrodeType 0 0.000000
PlanningX 0 0.000000
PlanningY 0 0.000000
PlanningZ 0 0.000000
PlanningRing 271 31.293303
PlanningArc 271 31.293303
DuraTipDistancePlanned 54 6.235566
EntryX 279 32.217090
EntryY 119 13.741339
EntryZ 131 15.127021
TipX 0 0.000000
TipY 0 0.000000
TipZ 0 0.000000
SkinSkullDistance 0 0.000000
SkullThickness 0 0.000000
SkullAngle 0 0.000000
ScrewLength 540 62.355658

In [6]:
#not used features
electrodes.drop(['EntryX', 'EntryY', 'EntryZ','ScrewLength','PlanningRing','PlanningArc'], axis = 1, inplace = True)

In [7]:
# calculate TPLE and remove entry data from dataframe
electrodes['TPLE'] = np.sqrt(np.square(electrodes['TipX'] - electrodes['PlanningX']) + 
                              np.square(electrodes['TipY'] - electrodes['PlanningY']) + 
                              np.square(electrodes['TipZ'] - electrodes['PlanningZ'])
                             ).round(1)

In [8]:
#plt.figure(figsize=[16, 8])
#sns.distplot(electrodes['TPLE'], hist=False, rug=True)
#sns.distplot(electrodes['TPLE'])

In [9]:
plt.figure(figsize=[16, 8])
sns.kdeplot(electrodes[electrodes['PatientPosition']=='Supine']['TPLE'], label='Supine')
sns.kdeplot(electrodes[electrodes['PatientPosition']=='Prone']['TPLE'], label='Prone',  color='y')


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20b8c9e8>

It looks like their distributions come from different samples


In [10]:
print('TPLE Mean Supine: {} (SD {}) \nTPLE Mean Prone: {} (SD {})'.format(
    round(electrodes[electrodes['PatientPosition']=='Supine']['TPLE'].mean(),2), round(electrodes[electrodes['PatientPosition']=='Supine']['TPLE'].std(),2), 
    round(electrodes[electrodes['PatientPosition']=='Prone']['TPLE'].mean(),2), round(electrodes[electrodes['PatientPosition']=='Prone']['TPLE'].std(),2)))


TPLE Mean Supine: 3.45 (SD 2.88) 
TPLE Mean Prone: 3.92 (SD 2.75)

Let's prove if indeed are independet samples with a t-test. If we observe a large p-value, then we cannot reject the null hypothesis of identical means. If the p-value is smaller than alpha, then we reject the null hypothesis of equal means.


In [11]:
alpha = 0.05
t_stat, p = stats.mstats.ttest_ind(electrodes[electrodes['PatientPosition']=='Supine']['TPLE'],
                                   electrodes[electrodes['PatientPosition']=='Prone']['TPLE'])
if p < alpha: 
    print("p= {} The null hypothesis can be rejected".format(p))
else:
    print("p= {} The null hypothesis cannot be rejected".format(p))


p= 0.04187565196970924 The null hypothesis can be rejected

So it means the two groups came from diferent populations, and we might trate them different

We do the same with electrode type


In [12]:
print('TPLE Mean Oblique: {} (SD {}) \nTPLE Mean Orthogonal: {} (SD {})'.format(
    round(electrodes[electrodes['ElectrodeType']=='Oblique']['TPLE'].mean(),2), round(electrodes[electrodes['ElectrodeType']=='Oblique']['TPLE'].std(),2), 
    round(electrodes[electrodes['ElectrodeType']=='Orthogonal']['TPLE'].mean(),2), round(electrodes[electrodes['ElectrodeType']=='Orthogonal']['TPLE'].std(),2)))


TPLE Mean Oblique: 3.44 (SD 2.93) 
TPLE Mean Orthogonal: 3.81 (SD 2.68)

Computing t-test.


In [13]:
alpha = 0.05
t_stat, p = stats.mstats.ttest_ind(electrodes[electrodes['ElectrodeType']=='Oblique']['TPLE'],
                                   electrodes[electrodes['ElectrodeType']=='Orthogonal']['TPLE'])
if p < alpha: 
    print("p= {} The null hypothesis can be rejected".format(p))
else:
    print("p= {} The null hypothesis cannot be rejected".format(p))


p= 0.07447724453227494 The null hypothesis cannot be rejected

In [14]:
#missing_values_table(electrodes)

In [15]:
#taking into account came from different distributions
df_supine = electrodes[electrodes['PatientPosition']=='Supine']
df_prone = electrodes[electrodes['PatientPosition']=='Prone']

Input missing variables

We start with the variable to input DuraTipDistancePlanned, but first take a look to the distribution


In [16]:
dura_sup = df_supine['DuraTipDistancePlanned'].dropna()
dura_pro = df_prone['DuraTipDistancePlanned'].dropna()

In [17]:
# normallity test
def normal_test(vector, alpha = 0.05):
    k2, p = stats.mstats.normaltest(vector)
    if p < alpha: # null hypothesis: x comes from a normal distribution
        print("p = {} The null hypothesis can be rejected".format(p))
    else: 
        print("p = {} The null hypothesis cannot be rejected".format(p))

In [18]:
#electrodes['DuraTipDistancePlanned'].hist()
plt.figure(figsize=[16, 8])
sns.distplot(dura_sup)
sns.distplot(dura_pro)
normal_test(dura_sup)
normal_test(dura_pro)


p = 5.988057054136884e-17 The null hypothesis can be rejected
p = 0.0005791646181393273 The null hypothesis can be rejected

They are normal distribuited so we might input variables with the statistics of the variable


In [19]:
def input_mean(df, variable):
    df_null = df[df[variable].isnull()]
    df_not_null = df[df[variable].notnull()]
    df_null[variable] = round(df_not_null[variable].mean(),1)
    inputed_df = df_null.append(df_not_null).sort_index()
    return inputed_df

In [20]:
df_supine = input_mean(df_supine, 'DuraTipDistancePlanned')

In [21]:
df_prone = input_mean(df_prone, 'DuraTipDistancePlanned')

We merge again


In [22]:
electrodes = df_supine.append(df_prone).sort_index()

In [23]:
#missing_values_table(electrodes)

Next variable to input is the number of contacts


In [24]:
#electrodes.insert(0, 'Index', electrodes.index)

In [25]:
contact_null = electrodes[electrodes['Contacts'].isnull()]
contact_not_null = electrodes[electrodes['Contacts'].notnull()]

In [26]:
contact_not_null['Contacts'] = contact_not_null['Contacts'].astype(int)

We need to convert into sframe


In [27]:
sf_contact_not_null = tc.SFrame(data=contact_not_null)

In [28]:
# features for contacts classifier
features = ['PatientPosition', 'ElectrodeType', 'PlanningX', 'PlanningY', 'PlanningZ', 'DuraTipDistancePlanned', 'SkinSkullDistance', 'SkullThickness', 'SkullAngle']

In [29]:
# We train a classifier
train_data, test_data = sf_contact_not_null.random_split(0.85)

clf1 = tc.classifier.create(train_data, target='Contacts', features = features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: BoostedTreesClassifier, RandomForestClassifier, DecisionTreeClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
Boosted trees classifier:
--------------------------------------------------------
Number of examples          : 658
Number of classes           : 6
Number of feature columns   : 9
Number of unpacked features : 9
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| Iteration | Elapsed Time | Training-accuracy | Validation-accuracy | Training-log_loss | Validation-log_loss |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| 1         | 0.025062     | 0.820669          | 0.545455            | 1.309959          | 1.496785            |
| 2         | 0.044466     | 0.851064          | 0.545455            | 1.053785          | 1.417049            |
| 3         | 0.063745     | 0.873860          | 0.545455            | 0.871295          | 1.364482            |
| 4         | 0.090400     | 0.886018          | 0.545455            | 0.736904          | 1.317326            |
| 5         | 0.111388     | 0.899696          | 0.545455            | 0.634599          | 1.318138            |
| 6         | 0.131019     | 0.914894          | 0.545455            | 0.545460          | 1.321431            |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
Random forest classifier:
--------------------------------------------------------
Number of examples          : 658
Number of classes           : 6
Number of feature columns   : 9
Number of unpacked features : 9
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| Iteration | Elapsed Time | Training-accuracy | Validation-accuracy | Training-log_loss | Validation-log_loss |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| 1         | 0.029227     | 0.727964          | 0.424242            | 0.919855          | 1.540483            |
| 2         | 0.045659     | 0.808511          | 0.515152            | 0.844761          | 1.434776            |
| 3         | 0.060199     | 0.834347          | 0.515152            | 0.800702          | 1.419205            |
| 4         | 0.074599     | 0.841945          | 0.515152            | 0.798519          | 1.411447            |
| 5         | 0.088015     | 0.855623          | 0.606061            | 0.776646          | 1.362726            |
| 6         | 0.101677     | 0.866261          | 0.575758            | 0.771824          | 1.355830            |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
Decision tree classifier:
--------------------------------------------------------
Number of examples          : 658
Number of classes           : 6
Number of feature columns   : 9
Number of unpacked features : 9
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| Iteration | Elapsed Time | Training-accuracy | Validation-accuracy | Training-log_loss | Validation-log_loss |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| 1         | 0.025007     | 0.820669          | 0.545455            | 0.666157          | 1.287582            |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
Logistic regression:
--------------------------------------------------------
Number of examples          : 658
Number of classes           : 6
Number of feature columns   : 9
Number of unpacked features : 9
Number of coefficients      : 50
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+-------------------+---------------------+
| Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
+-----------+----------+--------------+-------------------+---------------------+
| 1         | 2        | 0.003473     | 0.489362          | 0.333333            |
| 2         | 3        | 0.007049     | 0.536474          | 0.424242            |
| 3         | 4        | 0.011057     | 0.545593          | 0.454545            |
| 4         | 5        | 0.014755     | 0.553191          | 0.454545            |
| 5         | 6        | 0.018482     | 0.551672          | 0.454545            |
| 6         | 7        | 0.022079     | 0.551672          | 0.454545            |
+-----------+----------+--------------+-------------------+---------------------+
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.5454545021057129
PROGRESS: RandomForestClassifier          : 0.5454545021057129
PROGRESS: DecisionTreeClassifier          : 0.5454545021057129
PROGRESS: LogisticClassifier              : 0.454545
PROGRESS: ---------------------------------------------
PROGRESS: Selecting BoostedTreesClassifier based on validation set performance.
SUCCESS: Optimal solution found.


In [30]:
clf1


Out[30]:
Class                          : BoostedTreesClassifier

Schema
------
Number of examples             : 658
Number of feature columns      : 9
Number of unpacked features    : 9
Number of classes              : 6

Settings
--------
Number of trees                : 60
Max tree depth                 : 6
Training time (sec)            : 0.2123
Training accuracy              : 0.965
Validation accuracy            : 0.5455
Training log_loss              : 0.335
Validation log_loss            : 1.4106

In [31]:
#clf1.evaluate(test_data)
#clf1.evaluate(train_data)

In [32]:
#now convert the rest of the data
sf_contact_null = tc.SFrame(data=contact_null)
new_contacts = clf1.predict(sf_contact_null[features])
contact_null['Contacts'] = new_contacts

New electrodes table with inputed values


In [33]:
electrodes = contact_null.append(contact_not_null).sort_index()

In [34]:
#missing_values_table(electrodes)
# No missing values anymore

How the categories look like on the two different groups?


In [35]:
contacts_supine = df_supine.groupby('Contacts').count()['TPLE']
contacts_prone = df_prone.groupby('Contacts').count()['TPLE']

In [36]:
plt.figure(figsize=[20, 8])
plt.subplot(1,2,1); plt.xticks(contacts_supine.index)
plt.bar(contacts_supine.index, contacts_supine)
plt.subplot(1,2,2), plt.xticks(contacts_prone.index)
plt.bar(contacts_prone.index, contacts_prone, color='y')


Out[36]:
<BarContainer object of 6 artists>

Target Variables

Now we will remove large outliers (difference between planned and real coord) in the Z-axis as electrode insertion length (depth) is influenced also by other factors (calculations regarding depth which could lead to either too superficial or too deep, but also possible malfixation of the screw cap which may cause loosening of the electrode and hence a more superficial position.. it won't migrate into the depth spontaneously). These are very limited numbers and would too much influence further analysis.


In [37]:
plt.figure(figsize=[16, 8])
coordinates = ['PlanningX','PlanningY','PlanningZ','TipX','TipY','TipZ']
for c in coordinates:
    sns.kdeplot(electrodes[str(c)], label=str(c))


Every Axis follows a particular distribution, let's try a normal test


In [38]:
for c in coordinates:
    normal_test(electrodes[str(c)])


p = 6.9465819854989e-84 The null hypothesis can be rejected
p = 1.882265752281482e-07 The null hypothesis can be rejected
p = 0.0002928270371526327 The null hypothesis can be rejected
p = 1.404931961375195e-62 The null hypothesis can be rejected
p = 1.9081660369783106e-06 The null hypothesis can be rejected
p = 0.00010052998936429912 The null hypothesis can be rejected

Not normal distribuited


In [39]:
electrodes.head()


Out[39]:
PatientPosition Contacts ElectrodeType PlanningX PlanningY PlanningZ DuraTipDistancePlanned TipX TipY TipZ SkinSkullDistance SkullThickness SkullAngle TPLE
0 Supine 18 Oblique 125.8 106.5 135.5 86.6 126.4 106.8 135.2 7.0 9.7 70.3 0.7
1 Supine 18 Oblique 130.6 131.0 136.4 105.9 134.7 132.9 136.0 9.4 7.4 63.8 4.5
2 Supine 10 Oblique 139.1 104.9 124.0 35.2 139.1 108.3 124.4 9.8 5.5 66.4 3.4
3 Supine 10 Oblique 137.7 112.2 115.0 35.9 136.4 115.2 115.0 8.4 6.5 66.3 3.3
4 Supine 12 Oblique 126.0 76.0 124.1 39.5 124.6 75.8 126.8 7.4 4.6 84.7 3.0

We need to structure our data properly for further analysis and convert the categorical variables (nominal, ordinal) to the category type.


Regression Model to Predict Coordinates


In [40]:
electrodes.drop('TPLE', axis = 1, inplace = True)

In [41]:
#electrodes.insert(0, 'Index', range(0, len(electrodes), 1))
sf_electrodes = tc.SFrame(data=electrodes)

In [42]:
sf_electrodes.head()


Out[42]:
PatientPosition Contacts ElectrodeType PlanningX PlanningY PlanningZ DuraTipDistancePlanned TipX
Supine 18 Oblique 125.8 106.5 135.5 86.6 126.4
Supine 18 Oblique 130.6 131.0 136.4 105.9 134.7
Supine 10 Oblique 139.1 104.9 124.0 35.2 139.1
Supine 10 Oblique 137.7 112.2 115.0 35.9 136.4
Supine 12 Oblique 126.0 76.0 124.1 39.5 124.6
Supine 15 Oblique 134.8 121.8 119.8 55.3 134.2
Supine 15 Oblique 131.8 117.3 118.4 87.9 133.5
Supine 18 Oblique 66.4 123.9 116.4 74.1 67.4
Supine 10 Oblique 146.0 58.0 109.9 32.6 150.1
Supine 10 Orthogonal 131.1 85.4 108.8 49.3 127.8
TipY TipZ SkinSkullDistance SkullThickness SkullAngle
106.8 135.2 7.0 9.7 70.3
132.9 136.0 9.4 7.4 63.8
108.3 124.4 9.8 5.5 66.4
115.2 115.0 8.4 6.5 66.3
75.8 126.8 7.4 4.6 84.7
122.6 121.3 6.9 11.1 67.0
117.9 116.9 8.0 9.9 72.8
127.5 114.4 8.9 11.5 61.2
59.9 111.4 13.0 9.6 50.1
84.6 108.8 8.5 6.5 82.6
[10 rows x 13 columns]

In [43]:
sf_electrodes.num_rows()


Out[43]:
866

In [44]:
features_train = ['PatientPosition',
 'Contacts',
 'ElectrodeType',
 'PlanningX',
 'PlanningY',
 'PlanningZ',
 'DuraTipDistancePlanned','SkinSkullDistance',
 'SkullThickness',
 'SkullAngle']

In [45]:
targets_train = ['TipX', 'TipY', 'TipZ']

In [46]:
train_data, test_data = sf_electrodes.random_split(0.8)

In [47]:
def multi_boosted_tree(dataset, features, targets):
    models = []
    for target in targets:
        reg = tc.boosted_trees_regression.create(dataset, target=target, features = features, verbose=False)
        models.append(reg)
    return models

In [48]:
MBT = multi_boosted_tree(train_data, features_train, targets_train)

In [49]:
MBT


Out[49]:
[Class                          : BoostedTreesRegression
 
 Schema
 ------
 Number of examples             : 655
 Number of feature columns      : 10
 Number of unpacked features    : 10
 
 Settings
 --------
 Number of trees                : 10
 Max tree depth                 : 6
 Training time (sec)            : 0.0321
 Training rmse                  : 3.5981
 Validation rmse                : 4.3997
 Training max_error             : 14.8011
 Validation max_error           : 11.5872,
 Class                          : BoostedTreesRegression
 
 Schema
 ------
 Number of examples             : 660
 Number of feature columns      : 10
 Number of unpacked features    : 10
 
 Settings
 --------
 Number of trees                : 10
 Max tree depth                 : 6
 Training time (sec)            : 0.0254
 Training rmse                  : 3.6595
 Validation rmse                : 5.0908
 Training max_error             : 18.2009
 Validation max_error           : 14.4441,
 Class                          : BoostedTreesRegression
 
 Schema
 ------
 Number of examples             : 646
 Number of feature columns      : 10
 Number of unpacked features    : 10
 
 Settings
 --------
 Number of trees                : 10
 Max tree depth                 : 6
 Training time (sec)            : 0.0232
 Training rmse                  : 3.6027
 Validation rmse                : 3.9878
 Training max_error             : 18.9048
 Validation max_error           : 13.8965]

In [50]:
MBT[1].get_feature_importance()


Out[50]:
name index count
PlanningY None 78
PlanningX None 17
DuraTipDistancePlanned None 11
PlanningZ None 8
SkullThickness None 7
Contacts None 6
SkullAngle None 5
SkinSkullDistance None 5
ElectrodeType Oblique 2
PatientPosition Supine 1
[12 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [51]:
# Evaluate the model and save the results into a dictionary
#results = 
for i in range(3):
    print(MBT[i-1].evaluate(test_data))


{'max_error': 17.067489624023438, 'rmse': 4.474193676667111}
{'max_error': 16.47852325439453, 'rmse': 3.749562618801999}
{'max_error': 20.500869750976562, 'rmse': 4.282052307036403}

TPLE Comparsison with MBT


In [52]:
test_data['PredictX'] = [round(i,3) for i in MBT[0].predict(test_data)]
test_data['PredictY'] = [round(i,3) for i in MBT[1].predict(test_data)]
test_data['PredictZ'] = [round(i,3) for i in MBT[2].predict(test_data)]

In [53]:
test_data['TPLE'] = np.sqrt(np.square(test_data['TipX'] - test_data['PlanningX']) + 
                              np.square(test_data['TipY'] - test_data['PlanningY']) + 
                              np.square(test_data['TipZ'] - test_data['PlanningZ'])
                             ).round(1)

test_data['TPLE_Pred'] = np.sqrt(np.square(test_data['PredictX'] - test_data['PlanningX']) + 
                              np.square(test_data['PredictY'] - test_data['PlanningY']) + 
                              np.square(test_data['PredictZ'] - test_data['PlanningZ'])
                             ).round(1)

In [54]:
test_data


Out[54]:
PatientPosition Contacts ElectrodeType PlanningX PlanningY PlanningZ DuraTipDistancePlanned TipX
Supine 18 Oblique 125.8 106.5 135.5 86.6 126.4
Supine 8 Orthogonal 65.7 112.6 99.9 40.7 66.5
Supine 8 Orthogonal 67.4 119.8 113.3 40.3 68.3
Supine 15 Orthogonal 85.9 57.8 90.6 58.8 86.3
Supine 15 Oblique 59.5 87.7 64.4 23.4 58.7
Supine 12 Oblique 131.5 89.4 77.9 45.2 130.4
Prone 18 Oblique 73.4 101.0 134.2 86.9 71.5
Prone 18 Oblique 55.2 104.9 150.1 100.1 53.8
Prone 8 Oblique 65.8 97.2 111.8 54.4 64.2
Supine 8 Oblique 77.0 121.6 75.7 28.2 76.6
TipY TipZ SkinSkullDistance SkullThickness SkullAngle PredictX PredictY PredictZ TPLE TPLE_Pred
106.8 135.2 7.0 9.7 70.3 124.372 103.53 132.015 0.7 4.8
116.5 98.7 7.8 6.0 81.1 64.963 109.796 98.047 4.2 3.4
124.7 114.2 9.7 4.8 75.0 66.017 120.13 110.395 5.1 3.2
56.3 88.5 7.3 5.0 74.3 85.268 55.077 86.853 2.6 4.7
87.8 63.7 8.0 11.2 67.9 58.403 85.078 64.483 1.1 2.8
89.7 78.7 3.8 4.0 70.0 125.649 85.699 76.604 1.4 7.0
100.0 135.7 7.6 8.8 63.4 69.97 97.498 127.949 2.6 7.9
103.1 152.4 8.3 8.3 58.1 46.963 101.702 138.595 3.2 14.5
94.9 114.3 7.7 6.5 53.9 63.215 94.081 111.96 3.8 4.1
120.7 76.4 5.0 7.0 74.7 75.512 120.13 75.303 1.2 2.1
[180 rows x 18 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [91]:
#test_data['TPLE'].show()

In [92]:
#test_data['TPLE_Pred'].show()

In [57]:
rmse_1 = tc.evaluation.rmse(test_data['TPLE'], test_data['TPLE_Pred'])
print('Root Mean Squared Error for Boosted Tree Regression \n{}'.format(rmse_1))


Root Mean Squared Error for Boosted Tree Regression 
3.6

Point Estimator


In [58]:
td_supine_tple = test_data[test_data['PatientPosition']=='Supine']['TPLE']
td_prone_tple = test_data[test_data['PatientPosition']=='Prone']['TPLE']

In [59]:
print('TPLE Mean Supine: {} (SD {}) \nTPLE Mean Prone: {} (SD {})'.format(
    round(td_supine_tple.mean(),2), round(td_supine_tple.std(),2), 
    round(td_prone_tple.mean(),2), round(td_prone_tple.std(),2)))


TPLE Mean Supine: 3.32 (SD 2.0) 
TPLE Mean Prone: 4.28 (SD 3.17)

In [60]:
estimators = []
for i in range(len(test_data)):
    if test_data['PatientPosition'][i] == 'Supine':
        estimators.append(round(td_supine_tple.mean(),2))
    else:
        estimators.append(round(td_prone_tple.mean(),2))

In [61]:
test_data['TPLE_point'] = estimators

In [62]:
rmse_2 = tc.evaluation.rmse(test_data['TPLE'], test_data['TPLE_point'])
print('Root Mean Squared Error for Point Estimator \n{}'.format(rmse_2))


Root Mean Squared Error for Point Estimator 
2.337952570567295

In [63]:
#test_data.print_rows(172,19)

Deep learning

As a quick glance to deep learning we will apply a Multilayer Perceptron (MLP) for multi-class softmax classification using stochastic gradient descent as an optimizer. The code is borrowed from the official keras Sequential model guide and adapted where needed.


In [64]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [65]:
#First some preprocessing
X = electrodes[features_train].values
#column 0 and 2 are categorical
y = electrodes[targets_train].values

In [66]:
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=80)

In [68]:
sc = StandardScaler()
X_trainS = sc.fit_transform(X_train)
X_testS = sc.transform(X_test)

In [89]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from keras.utils.np_utils import to_categorical
from keras.callbacks import ModelCheckpoint

In [70]:
batch_size = 2
epochs = 100

In [71]:
model = Sequential()
model.add(Dense(output_dim = 60, activation = 'relu', input_dim = X_train.shape[1])) #(10,)
model.add(Dense(output_dim = 60, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(output_dim = 3, activation = 'linear')) # dim output (3, ) last layer should have linear activations like no activators

In [72]:
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 60)                660       
_________________________________________________________________
dense_2 (Dense)              (None, 60)                3660      
_________________________________________________________________
dropout_1 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 183       
=================================================================
Total params: 4,503
Trainable params: 4,503
Non-trainable params: 0
_________________________________________________________________

In [73]:
#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics = ['accuracy']) #mse

In [74]:
#epoch number, 2 decimals - valaccuracy 2 float
filepath = "../models/weights-nn-{epoch:02d}-{val_acc:.2f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=0, mode='max')

model.fit(X_trainS, y_train,
          batch_size=batch_size, verbose=0,
          epochs=epochs,
          validation_data=(X_testS, y_test),
          callbacks=[checkpoint])


Out[74]:
<keras.callbacks.History at 0x1a3b09ee80>

In [76]:
scores = model.evaluate(X_testS, y_test, batch_size=2)
print('\n\nrmse: {:.2f}\n\n{}: {:.2f}'.format(np.sqrt(scores[0]), model.metrics_names[1], scores[1]))


174/174 [==============================] - 0s 433us/step


rmse: 7.27

acc: 0.95

In [77]:
y_pred = model.predict(X_testS)

We are predicting the real carteasian points, so we want to calculate a "real TPLE" before the operation


TPLE Comparsison for NN


In [78]:
#feature_idx, feature_name = [], []
#features = []
#for idx, i in enumerate(electrodes.columns):
#    features.append((idx,i))
#    feature_idx.append(idx)
#    feature_name.append(i)

In [79]:
list_coor = ['PlanningX','PlanningY','PlanningZ','TipX','TipY','TipZ']
X = electrodes[list_coor].values #trying the TPLE calculation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=80)

In [80]:
coordenates = pd.DataFrame() 
coordenates['PlanningX'] = [i for i in X_test[:,0]]
coordenates['PlanningY'] = [i for i in X_test[:,1]]
coordenates['PlanningZ'] = [i for i in X_test[:,2]]
coordenates['TipX'] = [i for i in X_test[:,3]]
coordenates['TipY'] = [i for i in X_test[:,4]]
coordenates['TipZ'] = [i for i in X_test[:,5]]
coordenates['PredictX'] = [y_pred[i][0] for i in range(len(y_pred))]
coordenates['PredictY'] = [y_pred[i][1] for i in range(len(y_pred))]
coordenates['PredictZ'] = [y_pred[i][2] for i in range(len(y_pred))]

In [81]:
coordenates.head()


Out[81]:
PlanningX PlanningY PlanningZ TipX TipY TipZ PredictX PredictY PredictZ
0 92.6 54.1 110.0 94.2 45.7 110.0 84.341675 56.625088 102.770157
1 94.8 116.1 70.3 95.7 115.9 68.8 88.230461 105.434898 67.722328
2 68.2 95.9 67.8 71.6 93.8 69.8 63.377934 84.028664 64.539787
3 62.7 87.8 136.9 61.9 87.2 138.7 66.507332 87.694168 125.087517
4 131.7 83.8 65.8 131.1 82.5 64.6 121.721565 78.528969 69.191719

In [82]:
coordenates['TPLE'] = np.sqrt(np.square(coordenates['TipX'] - coordenates['PlanningX']) + 
                              np.square(coordenates['TipY'] - coordenates['PlanningY']) + 
                              np.square(coordenates['TipZ'] - coordenates['PlanningZ'])
                             ).round(1)

coordenates['TPLE_Pred'] = np.sqrt(np.square(coordenates['PredictX'] - coordenates['PlanningX']) + 
                              np.square(coordenates['PredictY'] - coordenates['PlanningY']) + 
                              np.square(coordenates['PredictZ'] - coordenates['PlanningZ'])
                             ).round(1)

In [83]:
coordenates.head(10)


Out[83]:
PlanningX PlanningY PlanningZ TipX TipY TipZ PredictX PredictY PredictZ TPLE TPLE_Pred
0 92.6 54.1 110.0 94.2 45.7 110.0 84.341675 56.625088 102.770157 8.6 11.3
1 94.8 116.1 70.3 95.7 115.9 68.8 88.230461 105.434898 67.722328 1.8 12.8
2 68.2 95.9 67.8 71.6 93.8 69.8 63.377934 84.028664 64.539787 4.5 13.2
3 62.7 87.8 136.9 61.9 87.2 138.7 66.507332 87.694168 125.087517 2.1 12.4
4 131.7 83.8 65.8 131.1 82.5 64.6 121.721565 78.528969 69.191719 1.9 11.8
5 61.1 97.0 105.2 62.1 96.5 106.5 55.112144 89.479988 99.073944 1.7 11.4
6 71.2 66.6 67.1 70.1 65.4 69.0 65.362091 56.500767 62.922848 2.5 12.4
7 69.5 98.1 112.4 72.1 99.6 113.9 64.195015 89.421608 100.860748 3.4 15.4
8 70.0 127.9 114.8 68.6 124.4 113.8 67.521965 116.666176 108.307312 3.9 13.2
9 95.6 40.5 84.4 85.1 43.4 82.1 92.237053 39.854797 79.850189 11.1 5.7

In [84]:
sf_coordenates = tc.SFrame(data=coordenates)

In [85]:
rmse_3 = tc.evaluation.rmse(sf_coordenates['TPLE'], sf_coordenates['TPLE_Pred'])
print('Root Mean Squared Error for Deep Learning \n{}'.format(rmse_3))


Root Mean Squared Error for Deep Learning 
8.749574702307786

Preliminary Results


In [86]:
rmseS = [rmse_1, rmse_2, rmse_3]
rsme_accuracy = [round(1 - (i/max(coordenates['TPLE'])),4) for i in rmseS]
rsme_accuracy


Out[86]:
[0.8182, 0.8819, 0.5581]

In [88]:
#rmse the smaller the better
y=rsme_accuracy#[round(rmse_1,1),round(rmse_2,1),round(rmse_3,1)]; 
x=list(range(len(y)))
top = [i+0.1 for i in y]; bot = [i-0.1 for i in y]; inter = list(zip(bot,top))
plt.figure(figsize=(16,8))
plt.errorbar(x,y,yerr=[(top-bot)/2 for top,bot in inter], fmt='o', label='Accuracy', color='red', linewidth=3)
plt.plot(x,y, color='black'); plt.legend(); plt.title('Accuracy of RMSE for TPLE')
plt.text(x[0]+.05, y[0]-.01, 'Multi-outupt Boosted Trees')
plt.text(x[1]+.05, y[1]+.01, 'Group Point Estimate')
plt.text(x[2]-.2, y[2]-.01, 'Deep Learning')


Out[88]:
Text(1.8,0.5481,'Deep Learning')

In [ ]: